Last week we introduced some of the key motivations behind Environmental Statistics.
The course will cover a number of statistical ideas around the general theme of environmental data.
This week we will be looking at uncertainty and variability, and how we can measure these and incorporate them into our conclusions.
We will then look at a number of important features of environmental data — censoring, outliers and missing data.
We often talk about uncertainty and error as though they are interchangeable, but this is not quite correct.
Error is the difference between the measured value and the “true value” of the thing being measured.
Uncertainty is a quantification of the variability of the measurement result.
Practically speaking, we make use of common statistical distributions to account for uncertainty.
A continuous random variable \(X\) follows a normal distribution with mean \(\mu\) and standard deviation \(\sigma\) if its probability density function (pdf) is:
\[ f(x) = \frac{1}{\sigma\sqrt{2\pi}} e^{-\frac{1}{2}\left(\frac{x-\mu}{\sigma}\right)^2} \]
We denote this as:
\[ X \sim \mathcal{N}(\mu, \sigma^2), ~\text{where} ~ -\infty < X < +\infty \]
Why can’t we just use normal distributions for all environmental data?
A random variable \(X\) follows a log-normal distribution if \(\ln(X)\) follows a normal distribution,i.e.
\[ Y = \ln(X) \sim \mathcal{N}(\mu, \sigma^2) \quad \text{where}~ Y\in (0, +\infty) \]
A random variable \(X\) follows an exponential distribution with rate parameter \(\lambda >0\) if its probability density function (pdf) is:
\[ f(x; \lambda) = \begin{cases} \lambda e^{-\lambda x} & \text{for } x \geq 0 \\ 0 & \text{for } x < 0 \end{cases} \]
\(\lambda\) describes the rate of events, i.e., the no. of events per unit time/distance
A discrete random variable \(X\) follows a Poisson distribution with rate parameter \(\lambda > 0\) if its probability mass function (PMF) is:
\[ P(X = k) = \frac{\lambda^k e^{-\lambda}}{k!}, ~ k = 0, 1, \dots \]
We denote this as \(X \sim Po(\lambda)\) where \(\lambda\) describes:
A discrete random variable \(X\) follows a binomial distribution with parameters \(n\) and \(p\) if:
\[ P(X = k) = \binom{n}{k} p^k (1-p)^{n-k}, \quad k = 0, 1, 2, \dots, n \]
We denote this as \(X \sim Bi(n, p)\) where:
\(n\) = number of independent trials
\(p\) = probability of success in each trial
\(k\) = number of successes observed
Survival studies: \(n\) animals, each with survival probability \(p\)
Detection/non-detection: \(n\) surveys, probability \(p\) of detecting species
A discrete random variable \(X\) follows a negative binomial distribution with parameters \(r\) and \(p\) if:
\[ P(X = k) = \binom{k + r - 1}{k} (1-p)^r p^k, ~ k = 0, 1, \dots \]
The distribution of the number of trials until the \(r\)th success is denoted by \(X\sim \mathrm{NegBi}(r,p)\) Where
All bathing water sites in Scotland are classified by SEPA as “Excellent”, “Good”, “Sufficient” or “Poor” in terms of how much faecal bacteria (from sewage) they contain.
The minimum standard all beaches or bathing water must meet is “Sufficient”.
The sites are classified based on the 90th and 95th percentiles of samples taken over the four most recent bathing seasons.
Green is excellent , blue is good, red is sufficient
The classification system assumes that bacterial concentrations at each site follow a log-normal distribution.
If this assumption does not hold, the classifications would not be accurate.
Therefore, it is crucial that we regularly assess this assumption to ensure the safety of our bathing water.
We can use our standard residual plots to assess log-normality.
The top plots show the standard residuals and the bottom plots show the residuals for the log-transformed data.
There is no strong evidence to suggest we have breached our assumptions.
Error in a measurement is the difference between the measured value and the true value.
Random error: Variation observed randomly over repeat measurements.
→ With more measurements, these errors average out (improves accuracy).
Systematic error: Variation that remains constant over repeated measures.
For each example, identify whether the error is random or systematic:
A meter reads 0.01 even when measuring no sample.
→ Hint: Constant offset regardless of measurement…
An old thermometer can only measure to the nearest 0.5 degrees.
→ Hint: Precision limitation…
A poorly designed rainfall monitor often leaks water on windy days.
→ Hint: Specific condition causing consistent bias…
To estimate the abundance of a fish species in a lake, scientists use a net with a mesh size equal to the average fish length
→ Hint: Only fishes up to a given size can be caught
Discuss with a neighbor!
Key takeaway: Random errors can be reduced by averaging; systematic errors require calibration, better instruments, or method changes.